Introduction


Team information

Name NetID Email
Artsiom Strok astrok2
Lin Jiang linj3
Mayank Chhablani mchhab2

Description of the dataset

Life expectancy has increased dramatically over the last few centuries. Since 1900 the global average life expectancy has more than doubled and there has been a huge development in health sector in the past 15 years resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project, only data from year 2000-2015 is considered for exploration and analysis.

This dataset is a compilation of data from the Global Health Observatory (GHO) and United Nations. The GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources). The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).

This dataset is cleaned by removing some missing values, maily for population, Hepatitis B and GDP from less known countries and shared on Kaggle website (https://www.kaggle.com/kumarajarshi/life-expectancy-who). The final dataset contains 2938 observations with 22 variables which are more critical and representative among all the categories of health-related factors from year 2000 - 2015 for 193 countries.

The description of each variable for this dataset is listed below:

Numerical Response

  • Life expectancy: Life Expectancy in age

Numerical Predictors

  • Year: Year
  • Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
  • infant: deathsNumber of Infant Deaths per 1000 population
  • Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
  • Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  • Measles: Measles - number of reported cases per 1000 population
  • BMI: Average Body Mass Index of entire population
  • under-five deaths: Number of under-five deaths per 1000 population
  • Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
  • Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
  • Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  • HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
  • GDP: Gross Domestic Product per capita (in USD)
  • Population: Population of the country
  • thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
  • thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
  • Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
  • Schooling: Number of years of Schooling(years)

Categotical Predictors

  • Country: Country
  • Status: Developed or Developing status

Background information

The GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources).

The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).

By combining the data from these two databases and exploring the interactions between the variables such as immunization, moratlity, education and economic factors, we expect to build a robust multiple linear regression model to predict the life expectancy more accurately.

Statement of interest

For this project, we would like to:

  • identify the major factors that affect the life expectancy
  • explore the interactions between the variables such as immunization, moratlity, education, economic factors etc.
  • discover the corelation between different factors
  • perform residual and outlier diagnostics to exclude some influential points if any
  • build a robust multiple linear regression model to predict the life expectancy
  • build a classifier to estimate the status of a country (developing or developed)

Samples of the data

library(knitr)
data = read.csv("Life Expectancy Data.csv")
kable(t(data[sample(nrow(data), 5), ]))
2508 912 2358 976 2041
Country Sweden Fiji Slovenia Gambia Poland
Year 2013 2002 2003 2002 2014
Status Developed Developing Developed Developing Developed
Life.expectancy 81.9 67.9 76.5 56.6 77.3
Adult.Mortality 57 22 119 298 12
infant.deaths 0 0 0 3 2
Alcohol 7.30 1.85 11.58 2.08 10.71
percentage.expenditure 1212.7 206.1 203.3 0.0 243.8
Hepatitis.B 67 99 NA 92 96
Measles 51 304 0 32 0
BMI 58.5 52.2 52.9 19.1 61.1
under.five.deaths 0 0 0 6 2
Polio 98 94 95 86 94
Total.expenditure 11.97 3.48 8.77 3.75 6.35
Diphtheria 98 93 95 87 98
HIV.AIDS 0.1 0.1 0.1 2.5 0.1
GDP 6283 2260 1488 NA 14342
Population 96379 815691 1995733 NA 3811735
thinness..1.19.years 1.4 4.2 2.0 9.8 1.9
thinness.5.9.years 1.3 3.9 2.1 9.8 2.1
Income.composition.of.resources 0.904 0.687 0.843 0.392 0.850
Schooling 15.8 13.3 16.1 7.0 16.4

Methods

Required packages:

  • DAAG
  • corrplot
  • ggplot2
  • magrittr
  • knitr
  • tidyr
  • GGally
  • faraway
  • lmtest
  • MASS

Data analysis:

original_data = read.csv("Life Expectancy Data.csv")
kable(t(original_data[sample(nrow(original_data), 5), ]))
1380 2304 1125 1029 2525
Country Kiribati Sierra Leone Haiti Greece Switzerland
Year 2014 2009 2013 2013 2012
Status Developing Developing Developing Developing Developed
Life.expectancy 66.1 47.1 62.7 86.0 82.7
Adult.Mortality 2 433 253 74 54
infant.deaths 0 28 14 0 0
Alcohol 0.01 3.97 5.68 7.46 9.86
percentage.expenditure 97.87 49.84 4.99 2183.11 18379.33
Hepatitis.B 75 84 68 98 NA
Measles 0 31 0 3 61
BMI 77.1 21.2 47.7 65.4 56.2
under.five.deaths 0 42 19 0 0
Polio 79 81 67 99 96
Total.expenditure 1.21 13.13 8.10 9.26 11.59
Diphtheria 75 84 68 99 96
HIV.AIDS 0.1 1.7 0.5 0.1 0.1
GDP 1684.54 394.59 81.27 21874.82 83164.39
Population 11458 63126 1431776 1965211 7996861
thinness..1.19.years 0.1 8.5 3.9 0.8 0.5
thinness.5.9.years 0.1 8.4 3.9 0.7 0.3
Income.composition.of.resources 0.597 0.375 0.483 0.860 0.932
Schooling 11.9 8.5 9.1 17.1 15.7
kable(sort(colSums(is.na(original_data)), decreasing = TRUE), col.names = "Number of missing values")
Number of missing values
Population 652
Hepatitis.B 553
GDP 448
Total.expenditure 226
Alcohol 194
Income.composition.of.resources 167
Schooling 163
BMI 34
thinness..1.19.years 34
thinness.5.9.years 34
Polio 19
Diphtheria 19
Life.expectancy 10
Adult.Mortality 10
Country 0
Year 0
Status 0
infant.deaths 0
percentage.expenditure 0
Measles 0
under.five.deaths 0
HIV.AIDS 0

1289 samples have at least one missing value. Alcohol is missing in 194 samples all of them belongs to 2015 year and these countiries definately should have alcohol consuption more than 0. The data was collected in 2015 when data about alcohol consumption simply was not available.

Life expextancy and adult mortality is missing for 10 samples in 2013, all of them belongs to islands.

Hepatitis B is missing in 553 samples. Samples belongs to different countries and years.

nrow(original_data)
## [1] 2938
data = na.omit(original_data)
nrow(original_data) - nrow(data)
## [1] 1289
data = data[-1] #exclude country from dataset
kable(t(do.call(cbind, lapply(data, summary))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
Year 2000.000 2005.000 2.008e+03 2.008e+03 2.011e+03 2.015e+03
Status 242.000 1407.000 2.420e+02 1.407e+03 2.420e+02 1.407e+03
Life.expectancy 44.000 64.400 7.170e+01 6.930e+01 7.500e+01 8.900e+01
Adult.Mortality 1.000 77.000 1.480e+02 1.682e+02 2.270e+02 7.230e+02
infant.deaths 0.000 1.000 3.000e+00 3.255e+01 2.200e+01 1.600e+03
Alcohol 0.010 0.810 3.790e+00 4.533e+00 7.340e+00 1.787e+01
percentage.expenditure 0.000 37.439 1.451e+02 6.990e+02 5.094e+02 1.896e+04
Hepatitis.B 2.000 74.000 8.900e+01 7.922e+01 9.600e+01 9.900e+01
Measles 0.000 0.000 1.500e+01 2.224e+03 3.730e+02 1.314e+05
BMI 2.000 19.500 4.370e+01 3.813e+01 5.580e+01 7.710e+01
under.five.deaths 0.000 1.000 4.000e+00 4.422e+01 2.900e+01 2.100e+03
Polio 3.000 81.000 9.300e+01 8.356e+01 9.700e+01 9.900e+01
Total.expenditure 0.740 4.410 5.840e+00 5.956e+00 7.470e+00 1.439e+01
Diphtheria 2.000 82.000 9.200e+01 8.416e+01 9.700e+01 9.900e+01
HIV.AIDS 0.100 0.100 1.000e-01 1.984e+00 7.000e-01 5.060e+01
GDP 1.681 462.150 1.593e+03 5.566e+03 4.719e+03 1.192e+05
Population 34.000 191897.000 1.420e+06 1.465e+07 7.659e+06 1.294e+09
thinness..1.19.years 0.100 1.600 3.000e+00 4.851e+00 7.100e+00 2.720e+01
thinness.5.9.years 0.100 1.700 3.200e+00 4.908e+00 7.100e+00 2.820e+01
Income.composition.of.resources 0.000 0.509 6.730e-01 6.316e-01 7.510e-01 9.360e-01
Schooling 4.200 10.300 1.230e+01 1.212e+01 1.400e+01 2.070e+01

Looking at the summary data we can already see some inconsistencies. In Infant Deaths we see that the max value listed is 1600 which doesn’t make sense since we’re working with per 1000 population data. The same or similar numbers we can see for Infant deaths, Measles, Under five deaths

boxplot(data$Adult.Mortality)

boxplot(data$infant.deaths)

boxplot(data$Measles)

boxplot(data$under.five.deaths)

This original dataset contains Country variable which we would not use for model building. Therefore, this column is excluded from the following data analysis as well as the records with NA in some of the columns mentioned above. This will reduce the size of the original dataset from 2938 to 1649 rows, which captures majority of the information and allows the speed of modeling to be more efficient.

Data Visualization

Let’s take a look on histograms for each attribute in the dataset.

ggplot(gather(data[-2]), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Let’s take a look of the dataset using plots of Life Expectancy vs Status or Year. The boxplot indicates that there is significant difference in Life Expectancy between the Developed and Developing countries. As expected, the Life Expectancy increases as the Year passes by and the Violin plot shows that the data is well distributed across different years.

par(mfrow = c(1, 2))
# Histogram of Life Expectancy
hist(
  data$Life.expectancy,
  xlab = "Life Expectancy",
  main = "Distribution of Life Expectancy",
  col = "dodgerblue",
  breaks = 25
)
# Boxplot of Life Expectancy vs Country Status (Developed vs Developing)
plot(
  data$Status,
  data$Life.expectancy,
  xlab = "Status",
  ylab = "Life Expectancy",
  main = "Life Expectancy vs. Status",
  col = c(2, 3)
)

# Violin plot of Life Expectancy vs. Year
data %>% ggplot() + geom_violin(aes(
  x = Year,
  y = Life.expectancy,
  group = Year,
  fill = Year
)) + labs(title = "Life Expectancy vs. Year")

Colinearity

Colinearity issue is visualized using the following plots. Resulsts show that some predictors have strong collinearity issues, such as infant.deaths vs. under.five.deaths, GDP vs. percentage.expenditure, Population vs. thinness..1.19.years etc.

corrplot(
  cor(data[-c(2)]),
  method = "color",
  order = "hclust",
  type = "lower",
  outline = TRUE,
  tl.col = "black",
  addCoef.col = "darkgreen",
)

ggpairs(data[-c(2)])

Train test split

# train test split 70/30 hold out
train_size = floor(0.7 * nrow(data))
train_idx = sample(nrow(data), train_size)
data.train = data[train_idx, ]
data.test = data[-train_idx, ]

Model Building

Steps to generate best model:

We will start with simple additive model and then use AIC/ BIC to weed out unnecessary predictors. We then, may try out interactions and see if that improves our model and later, we may use response or, predictor transformations to improve our selected model.

Additve Model with all predictors

model.additive = lm(Life.expectancy ~ ., data.train)
#summary(model.additive)
par(mfrow = c(2, 2))
plot(model.additive)

So, looking at Residual vs Fitted plot that we may have issue of linearity and equal-variance. Similary, we need to check for normality assumption as dictated by Normal Q-Q plot

Let’s check for Homoscedasticityand Normality assumptions.

show_metrics(model.additive)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
3.19638743734941e-21 0.000059802581149761 0.844706587093747 3.50866876426059 11.8303259589542 14.4884239442948

So, from above data it looks liek although Adjusted R^2 and RMSE is better, but, we REJECT both Homoscedasticityand Normality assumptions. Also, from summary of model it seems there is need to prune out some predictors.

Let’s apply AIC and BIC to see if we can improve our model

model.additive.step.aic = step(model.additive, trace = FALSE)

par(mfrow = c(2, 2))
plot(model.additive.step.aic)

So, looking at AIC model, we can see that GDPvs percentage.expenditure collinearity issue has been removed by AIC. Also, some of the non-significant predictors are also removed.

show_metrics(model.additive.step.aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.61540494570909e-22 0.0000318962005371406 0.84483096513633 3.50080398499081 11.8834501668962 14.4433330629638

Apart from reduction in predictors, there isn’t a significant improvement. Let’s try out BIC:

model.additive.step.bic = step(model.additive, k = log(nrow(data.train)), trace = FALSE)

par(mfrow = c(2, 2))
plot(model.additive.step.bic)

show_metrics(model.additive.step.bic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.12322671490772e-20 0.0000133651092912547 0.842918941917074 3.51818636644947 12.0721275312972 14.5125336442044

So, BIC and AIC resutls are almost same and so does the evaluatin metric but, we saw two improvements: (1) Reduction in number of predictors (2) Reduction in influential points as per the Residual vs Leverage graph.

Hence, let’s select AIC model and try interactions. The BIC is little aggressive in removing predictors so, we are selecting AIC additive model.

Also, let’s check for variance-inflation-factor

vif_aic_all_model = vif(model.additive.step.aic)
vif_aic_all_model[vif_aic_all_model > 5]
##     infant.deaths under.five.deaths 
##             225.2             220.1

As evident from above VIF computation, infant.deaths and under.five.deaths in the model are causing collinearity issue. This same issue we found during our correlation plot. Let’s try to improve the model and remove issues which are violating linearity assumptions.

Applying Two-Way Interaction:

model.interaction = lm(Life.expectancy ~ . ^ 2, data = data.train)

Since, this will explode the number of predictors let’s generate AIC and BIC models for this interaction model.

AIC Interaction Model:

model.interaction.step.aic = step(model.interaction, trace = 0)

par(mfrow = c(2, 2))
plot(model.interaction.step.aic)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

show_metrics(model.interaction.step.aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00598315752593769 5.02438486442735e-10 0.918642514611818 2.89436801651001 5.59611868899128 15.3358935315704

AIC and BIC model are again somewhat similar. Total numbr of predictors used by AIC is 119 and we can see that with interactions we have made improvement in adjusted r_squared and bptest. Let’s see the VIF imapct

vif_interaction_model = vif(model.interaction.step.aic)
vif_interaction_model[vif_interaction_model > 5]
##                                                   Year 
##                                              9.463e+01 
##                                       StatusDeveloping 
##                                              1.206e+03 
##                                        Adult.Mortality 
##                                              7.079e+05 
##                                          infant.deaths 
##                                              3.197e+08 
##                                                Alcohol 
##                                              1.245e+02 
##                                 percentage.expenditure 
##                                              8.236e+06 
##                                            Hepatitis.B 
##                                              7.088e+01 
##                                                Measles 
##                                              8.146e+02 
##                                                    BMI 
##                                              1.094e+02 
##                                      under.five.deaths 
##                                              3.528e+08 
##                                                  Polio 
##                                              6.172e+01 
##                                      Total.expenditure 
##                                              4.558e+05 
##                                             Diphtheria 
##                                              1.709e+02 
##                                               HIV.AIDS 
##                                              1.108e+06 
##                                                    GDP 
##                                              8.047e+06 
##                                             Population 
##                                              3.971e+06 
##                                   thinness..1.19.years 
##                                              9.647e+06 
##                                     thinness.5.9.years 
##                                              9.988e+06 
##                        Income.composition.of.resources 
##                                              1.820e+06 
##                                              Schooling 
##                                              1.380e+06 
##                                   Year:Adult.Mortality 
##                                              7.089e+05 
##                                     Year:infant.deaths 
##                                              3.238e+08 
##                            Year:percentage.expenditure 
##                                              8.243e+06 
##                                 Year:under.five.deaths 
##                                              3.577e+08 
##                                 Year:Total.expenditure 
##                                              4.577e+05 
##                                          Year:HIV.AIDS 
##                                              1.107e+06 
##                                               Year:GDP 
##                                              8.075e+06 
##                                        Year:Population 
##                                              4.022e+06 
##                              Year:thinness..1.19.years 
##                                              9.691e+06 
##                                Year:thinness.5.9.years 
##                                              1.003e+07 
##                   Year:Income.composition.of.resources 
##                                              1.825e+06 
##                                         Year:Schooling 
##                                              1.389e+06 
##                       StatusDeveloping:Adult.Mortality 
##                                              8.924e+01 
##                         StatusDeveloping:infant.deaths 
##                                              1.800e+06 
##                           StatusDeveloping:Hepatitis.B 
##                                              6.603e+01 
##                     StatusDeveloping:under.five.deaths 
##                                              2.288e+06 
##                     StatusDeveloping:Total.expenditure 
##                                              2.777e+01 
##                            StatusDeveloping:Diphtheria 
##                                              4.078e+02 
##                  StatusDeveloping:thinness..1.19.years 
##                                              3.142e+04 
##                    StatusDeveloping:thinness.5.9.years 
##                                              2.619e+04 
##       StatusDeveloping:Income.composition.of.resources 
##                                              1.092e+03 
##                             StatusDeveloping:Schooling 
##                                              2.695e+02 
##                 Adult.Mortality:percentage.expenditure 
##                                              4.291e+01 
##                                    Adult.Mortality:BMI 
##                                              1.228e+01 
##                      Adult.Mortality:under.five.deaths 
##                                              4.157e+01 
##                               Adult.Mortality:HIV.AIDS 
##                                              1.225e+01 
##                                    Adult.Mortality:GDP 
##                                              4.236e+01 
##                             Adult.Mortality:Population 
##                                              2.845e+01 
##        Adult.Mortality:Income.composition.of.resources 
##                                              5.368e+01 
##                              Adult.Mortality:Schooling 
##                                              9.330e+01 
##                                  infant.deaths:Alcohol 
##                                              3.618e+01 
##                   infant.deaths:percentage.expenditure 
##                                              8.951e+02 
##                              infant.deaths:Hepatitis.B 
##                                              5.134e+03 
##                                      infant.deaths:BMI 
##                                              1.648e+03 
##                        infant.deaths:under.five.deaths 
##                                              4.808e+02 
##                                    infant.deaths:Polio 
##                                              2.376e+02 
##                               infant.deaths:Diphtheria 
##                                              2.092e+04 
##                                 infant.deaths:HIV.AIDS 
##                                              1.872e+01 
##                                      infant.deaths:GDP 
##                                              1.577e+03 
##                               infant.deaths:Population 
##                                              7.406e+04 
##                     infant.deaths:thinness..1.19.years 
##                                              1.385e+04 
##          infant.deaths:Income.composition.of.resources 
##                                              4.829e+03 
##                                    Alcohol:Hepatitis.B 
##                                              4.749e+01 
##                              Alcohol:Total.expenditure 
##                                              3.343e+01 
##                                     Alcohol:Diphtheria 
##                                              1.126e+02 
##                                       Alcohol:HIV.AIDS 
##                                              1.597e+01 
##                                     Alcohol:Population 
##                                              2.349e+01 
##                           Alcohol:thinness..1.19.years 
##                                              1.078e+01 
##                     percentage.expenditure:Hepatitis.B 
##                                              2.800e+01 
##               percentage.expenditure:under.five.deaths 
##                                              1.175e+03 
##                      percentage.expenditure:Diphtheria 
##                                              4.224e+02 
##                             percentage.expenditure:GDP 
##                                              2.275e+01 
## percentage.expenditure:Income.composition.of.resources 
##                                              2.767e+03 
##                       percentage.expenditure:Schooling 
##                                              2.626e+03 
##                                    Hepatitis.B:Measles 
##                                              6.286e+01 
##                          Hepatitis.B:under.five.deaths 
##                                              5.359e+03 
##                          Hepatitis.B:Total.expenditure 
##                                              3.663e+01 
##                                 Hepatitis.B:Diphtheria 
##                                              2.463e+01 
##                                 Hepatitis.B:Population 
##                                              8.144e+01 
##                       Hepatitis.B:thinness..1.19.years 
##                                              5.593e+02 
##                         Hepatitis.B:thinness.5.9.years 
##                                              5.507e+02 
##                                            Measles:BMI 
##                                              1.992e+01 
##                                          Measles:Polio 
##                                              2.424e+02 
##                              Measles:Total.expenditure 
##                                              3.579e+01 
##                                     Measles:Diphtheria 
##                                              3.718e+02 
##                                       Measles:HIV.AIDS 
##                                              9.406e+00 
##                                            Measles:GDP 
##                                              5.379e+00 
##                                     Measles:Population 
##                                              1.781e+02 
##                             Measles:thinness.5.9.years 
##                                              2.053e+02 
##                Measles:Income.composition.of.resources 
##                                              3.891e+02 
##                                  BMI:under.five.deaths 
##                                              1.656e+03 
##                                                BMI:GDP 
##                                              1.799e+01 
##                                         BMI:Population 
##                                              4.187e+01 
##                               BMI:thinness..1.19.years 
##                                              1.681e+02 
##                                 BMI:thinness.5.9.years 
##                                              1.615e+02 
##                    BMI:Income.composition.of.resources 
##                                              1.320e+02 
##                                          BMI:Schooling 
##                                              1.883e+02 
##                           under.five.deaths:Diphtheria 
##                                              1.845e+04 
##                                  under.five.deaths:GDP 
##                                              1.908e+03 
##                           under.five.deaths:Population 
##                                              6.835e+04 
##                 under.five.deaths:thinness..1.19.years 
##                                              1.394e+04 
##                   under.five.deaths:thinness.5.9.years 
##                                              9.257e+02 
##                            under.five.deaths:Schooling 
##                                              1.231e+03 
##                                Polio:Total.expenditure 
##                                              5.683e+01 
##                                              Polio:GDP 
##                                              1.340e+02 
##                             Polio:thinness..1.19.years 
##                                              2.202e+02 
##                               Polio:thinness.5.9.years 
##                                              2.050e+02 
##                  Polio:Income.composition.of.resources 
##                                              1.081e+02 
##                           Total.expenditure:Diphtheria 
##                                              7.080e+01 
##                             Total.expenditure:HIV.AIDS 
##                                              3.156e+01 
##                                  Total.expenditure:GDP 
##                                              1.741e+01 
##                           Total.expenditure:Population 
##                                              8.373e+01 
##                 Total.expenditure:thinness..1.19.years 
##                                              1.884e+01 
##      Total.expenditure:Income.composition.of.resources 
##                                              6.184e+01 
##             Diphtheria:Income.composition.of.resources 
##                                              1.361e+02 
##                                   Diphtheria:Schooling 
##                                              1.910e+02 
##                                    HIV.AIDS:Population 
##                                              5.303e+00 
##                          HIV.AIDS:thinness..1.19.years 
##                                              1.491e+02 
##                            HIV.AIDS:thinness.5.9.years 
##                                              1.469e+02 
##                                 GDP:thinness.5.9.years 
##                                              6.555e+00 
##                                          GDP:Schooling 
##                                              2.859e+03 
##                          Population:thinness.5.9.years 
##                                              4.160e+02 
##                                   Population:Schooling 
##                                              7.387e+02 
##                thinness..1.19.years:thinness.5.9.years 
##                                              6.667e+01 
##   thinness..1.19.years:Income.composition.of.resources 
##                                              4.919e+03 
##                         thinness..1.19.years:Schooling 
##                                              3.574e+03 
##     thinness.5.9.years:Income.composition.of.resources 
##                                              4.938e+03 
##                           thinness.5.9.years:Schooling 
##                                              3.794e+03

This interaction model is getting overly complicated. Let’s inspect pairs scatter plot.

pairs(data.frame(data$Life.expectancy, data$Year, data$Status, data$Adult.Mortality, data$infant.deaths, data$Alcohol, data$percentage.expenditure, data$BMI, data$under.five.deaths, data$Polio, data$Total.expenditure, data$HIV.AIDS, data$thinness.5.9.years, data$Income.composition.of.resources, data$Schooling))

Based on this cross-predictor plots, we confirm that (a) HIV.AIDS has non-linear relationship with Life.expectancy' and, (B)HIV.AIDSandAdult.Mortalityare also interacting non-linearly. Also, we may condsider removingYear` because it may divert model in different direction because of its chaotic relationship with the response.

Check for impact of high VIF predictors.

model_selected_wo_underfive = lm(
  Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

model_selected_wo_infant.deaths = lm(
  Life.expectancy ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

model_selected_underfive = lm(
  under.five.deaths ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI  + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

model_selected_infant.deaths = lm(
  infant.deaths ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

cor(resid(model_selected_underfive),
    resid(model_selected_wo_underfive))
## [1] -0.2527
cor(resid(model_selected_infant.deaths),
    resid(model_selected_wo_infant.deaths))
## [1] 0.2487

Based on this we can get rid of under.five.deaths

model_selected = lm(
  Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling ,
  data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected)

show_metrics(model_selected)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
5.59572259383389e-29 0.0000817238216404277 0.833747060667492 3.62603598753409 12.7434773099163 14.5659873637219
vif(model_selected)
##                            Year                StatusDeveloping 
##                           1.121                           1.802 
##                 Adult.Mortality                   infant.deaths 
##                           1.837                           1.392 
##                         Alcohol          percentage.expenditure 
##                           2.246                           1.418 
##                             BMI                           Polio 
##                           1.784                           1.199 
##               Total.expenditure                        HIV.AIDS 
##                           1.124                           1.576 
##              thinness.5.9.years Income.composition.of.resources 
##                           1.996                           3.263 
##                       Schooling 
##                           3.679

Looks like none of our predictor is causing inflation in variance. Let’s consider interaction of these predictors.

model_selected_interactions = lm(
  Life.expectancy ~ (
    Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling
  ) ^ 2 ,
  data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected_interactions)

show_metrics(model_selected_interactions)
## Warning in predict.lm(model, data.train): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(model, data.test): prediction from a rank-deficient fit
## may be misleading
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
4.84360148121369e-06 3.58602549267733e-09 0.89981396499582 2.96815448606524 7.16067923906753 10.356646923468
model_selected_interactions_aic = step(model_selected_interactions, trace = FALSE)
#summary(model_selected_interactions_aic)
show_metrics(model_selected_interactions_aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.44805155269171e-06 1.04271639578767e-09 0.901025514701528 2.87016492853926 7.27373007030712 10.5189876384089
model_selected_interactions_bic=step(model_selected_interactions, trace=0, k = log(nrow(data.train)))
#summary(model_selected_interactions_bic)

show_metrics(model_selected_interactions_bic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.09684726187374e-08 2.12464993566629e-10 0.896246382310163 2.9003927388848 7.77145266043916 10.7769449544755
par(mfrow=c(2,2))
plot(model_selected_interactions_bic)

model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + infant.deaths:Polio +  Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)

show_metrics(model_selected_poly_interactions)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
9.90771175086473e-12 7.59134585416287e-06 0.85857538224716 3.34510211995818 10.811830900963 12.8744470461626
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)

show_metrics(model_selected_poly_interactions)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
3.22341712873213e-11 7.76432430757746e-06 0.857347354152492 3.35556741522419 10.915304600423 12.7046394784675
vif_poly = vif(model_selected_poly_interactions)
vif_poly[vif_poly > 5]
##                 HIV.AIDS            I(HIV.AIDS^2) Adult.Mortality:HIV.AIDS 
##                   13.817                    9.387                    6.205
length(coef(model_selected_poly_interactions))
## [1] 16
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
model_selected_poly_interactions_aic=step(model_selected_poly_interactions, trace=0, k=log(nrow(data)))
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions_aic)

show_metrics(model_selected_poly_interactions_aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.30367180064734e-08 2.64023093149332e-06 0.855313070446564 3.37128342500836 11.1196033881997 13.0238764269319
vif_poly = vif(model_selected_poly_interactions_aic)
vif_poly[vif_poly > 5]
##                 HIV.AIDS            I(HIV.AIDS^2) Adult.Mortality:HIV.AIDS 
##                   13.115                    9.183                    6.023
length(coef(model_selected_poly_interactions_aic))
## [1] 11
model_selected_outliers = lm(
  Life.expectancy ~  Adult.Mortality  +
    infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources  +
    Schooling  + Adult.Mortality:HIV.AIDS + I(HIV.AIDS ^ 2),
  data = data.train,
  subset = abs(rstandard(model_selected_poly_interactions_aic)) < 2
)

par(mfrow = c(2, 2))
plot(model_selected_outliers)

show_metrics(model_selected_outliers)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.98371246826394e-22 0.0785951289988709 0.902757127129399 2.69776680411486 11.6767587462108 14.2696356612668

Based on above plots, we can say that we have achieved a good model based with 10 predictors and the formula is:

formula = Life.expectancy ~ Adult.Mortality + infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2)

Also, looking at the model it seems, we are satisfying assumptions of linear model.

Evaluation metric are as follows:

show_metrics(model_selected_outliers)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.98371246826394e-22 0.0785951289988709 0.902757127129399 2.69776680411486 11.6767587462108 14.2696356612668

Box-Cox confirms that it is not necessary to transform y variable.

# Box-Cox Transformation
par(mfrow = c(1, 2))
boxcox(model_selected_outliers, plotit = TRUE)
boxcox(model_selected_outliers,
       plotit = TRUE,
       lambda = seq(0.5, 1.5, by = 0.1))

Explore the relationship between Life.expectancy and each predictor.

plot(Life.expectancy ~ Adult.Mortality, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Adult.Mortality")

plot(Life.expectancy ~ infant.deaths, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs infant.deaths")

plot(Life.expectancy ~ Polio, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Polio")

plot(Life.expectancy ~ HIV.AIDS, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs HIV.AIDS")

plot(Life.expectancy ~ thinness.5.9.years, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs thinness.5.9.years")

plot(Life.expectancy ~ Income.composition.of.resources, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Income.composition.of.resources")

plot(Life.expectancy ~ Schooling, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Schooling")

lm_le_am = lm(Life.expectancy ~ Adult.Mortality, data = data.train)
show_metrics(lm_le_am)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.93888870084312e-18 5.44652981244873e-27 0.484496184454426 6.32984080904256 39.9298914485206 37.4349744962051
lm_le_am_log = lm(Life.expectancy ~ log(Adult.Mortality), data = data.train)
show_metrics(lm_le_am_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000109121008769902 4.88198600292786e-17 0.183382442308834 7.97190730155398 63.2535578792075 61.5102146650863
# bp.test result is significantly improved with log transformation of Adult.Mortality.
lm_le_am_poly = lm(Life.expectancy ~ Adult.Mortality + I(Adult.Mortality ^ 2), data = data.train)
show_metrics(lm_le_am_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
3.17932361175982e-25 2.76151917851856e-24 0.499632136383793 6.24283249651564 38.7238473195189 36.0427661597144
# Polynomial transformation doesn't work.
lm_le_id = lm(Life.expectancy ~ infant.deaths, data = data.train)
show_metrics(lm_le_id)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.02657747833814 8.27589915106478e-17 0.0284300058581662 8.69566904890079 75.2558627711913 74.8725034661356
# lm_le_id_log = lm(Life.expectancy ~ log(infant.deaths), data = data.train)
# couldn't perform log transformation because of O values.

lm_le_id_poly = lm(Life.expectancy ~ infant.deaths + I(infant.deaths ^ 2), data = data.train)
show_metrics(lm_le_id_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000298263106736209 2.51398765428739e-15 0.0893880358135724 8.42415869521536 70.4729484696273 71.4095417460136
# Polynomial transformation of infant.deaths doesn't improve.
lm_le_po = lm(Life.expectancy ~ Polio, data = data.train)
show_metrics(lm_le_po)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.66387530899153e-06 1.15592951036877e-13 0.119174477527461 8.27688462546276 68.2269780296235 71.0842938162934
lm_le_po_log = lm(Life.expectancy ~ log(Polio), data = data.train)
show_metrics(lm_le_po_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0318322360510644 9.61459294344335e-16 0.0571717899249146 8.56104117786634 73.0295818335663 74.7811670558807
# bp.test result is improved with log transformation of Polio.
lm_le_po_poly = lm(Life.expectancy ~ Polio + I(Polio ^ 2), data = data.train)
show_metrics(lm_le_po_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0200803369616035 6.03245675563947e-12 0.235744603268793 7.71173926979946 59.1463030464273 60.9914024560467
# Polynomial transformation also helps.
lm_le_hiv = lm(Life.expectancy ~ HIV.AIDS, data = data.train)
show_metrics(lm_le_hiv)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.33841446544784e-10 7.77718772625768e-07 0.362320502385172 7.04958025653526 49.3933746964793 52.1499460150461
lm_le_hiv_log = lm(Life.expectancy ~ log(HIV.AIDS), data = data.train)
show_metrics(lm_le_hiv_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00398413313373982 5.03204393449706e-06 0.644648154468367 5.25460096055283 27.5248411170202 29.573112720294
# bp.test result is significantly improved with log transformation of HIV.AIDS.
lm_le_hiv_poly = lm(Life.expectancy ~ HIV.AIDS + I(HIV.AIDS ^ 2), data = data.train)
show_metrics(lm_le_hiv_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0468046150825987 1.55206717595535e-06 0.487029498855961 6.33118333697985 39.6991749673099 43.4723916195385
# Polynomial transformation also helps.
lm_le_hiv_log_poly = lm(Life.expectancy ~ log(HIV.AIDS) + I(HIV.AIDS ^ 2), data = data.train) # doesn't further improve
show_metrics(lm_le_hiv_log_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00686870383790006 5.87198474501198e-06 0.644741393902713 5.25324248579441 27.4937321554455 29.5878606382631
lm_le_thn = lm(Life.expectancy ~ thinness.5.9.years, data = data.train)
show_metrics(lm_le_thn)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.6062500515926e-20 1.91144004921282e-15 0.21336005686777 7.82538800946802 60.9315519907407 61.6658609753975
lm_le_thn_log = lm(Life.expectancy ~ log(thinness.5.9.years), data = data.train)
show_metrics(lm_le_thn_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.289116205119675 4.52200164957121e-21 0.316302351683306 7.29420952049337 52.9578483371689 57.4977265421249
# bp.test result is significantly improved with log transformation.
lm_le_thn_poly = lm(Life.expectancy ~ thinness.5.9.years + I(thinness.5.9.years ^ 2),
                    data = data.train)
show_metrics(lm_le_thn_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
9.56992897947193e-12 1.33708429290685e-12 0.362390018182186 7.04917200577612 49.3451186230709 51.7643974324823
# Polynomial transformation doesn't help a lot.
lm_le_icr = lm(Life.expectancy ~ Income.composition.of.resources, data = data.train)
show_metrics(lm_le_icr)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
5.39868610150089e-72 9.78699333659299e-26 0.559087343704458 5.87166945958373 34.1522098833106 45.020683088903
# lm_le_icr_log = lm(Life.expectancy ~ log(Income.composition.of.resources), data = data.train) couldn't perform log tranformation due to 0 values.

# bp.test result is significantly improved with log transformation.
lm_le_icr_poly = lm(
  Life.expectancy ~ Income.composition.of.resources + I(Income.composition.of.resources ^ 2),
  data = data.train
)
show_metrics(lm_le_icr_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.37401259204068e-21 1.9291859350126e-13 0.69484854321261 4.87547569854632 23.6159019817213 24.7461223228291
# Polynomial transformation helps a little.
lm_le_sch = lm(Life.expectancy ~ Schooling, data = data.train)
show_metrics(lm_le_sch)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.0127149476637e-09 2.6063913651081e-19 0.532073853020822 6.03052121613411 36.2446206824462 36.7431166111802
lm_le_sch_log = lm(Life.expectancy ~ log(Schooling), data = data.train)
show_metrics(lm_le_sch_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
5.3272402214625e-15 9.94327190004781e-20 0.498226490690044 6.24724276659015 38.8663694705834 39.1081458945406
# log transformation doesn't work.
lm_le_sch_poly = lm(Life.expectancy ~ Schooling + I(Schooling ^ 2), data = data.train)
show_metrics(lm_le_sch_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
4.23704752476234e-08 2.84765808491887e-19 0.531961805515695 6.03435983226363 36.2218297792506 36.7111420288311
# Polynomial transformation doesn't help very much.
# New model based on the transformation result
model_trans = lm(Life.expectancy ~ log(Adult.Mortality) + infant.deaths + Polio  + I(Polio ^ 2) + HIV.AIDS + I(HIV.AIDS^2) + log(thinness.5.9.years) + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS, data = data.train)
show_metrics(model_selected_outliers)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.98371246826394e-22 0.0785951289988709 0.902757127129399 2.69776680411486 11.6767587462108 14.2696356612668
show_metrics(model_trans)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
8.40002629693026e-11 3.43722742650363e-07 0.836954948860762 3.58210034060524 12.530477415429 15.8469601746367
# After transformation, it shows that bptest and shapiro test results improves a little.

Results

MODEL COMPARISON + BEST MODEL + TRAIN vs TEST STATS

Discussion

Ideas for furhter improvemnts

Appendix

Put all not important stuff here